Methods and apparartus for monitoring abnormalities in data stream

ABSTRACT

A technique for monitoring a primary data stream comprising one or more secondary data streams for abnormalities is provided. A deviation value is determined for each of the one or more secondary data streams. The determined deviation values of the one or more secondary data streams are combined to form a combined deviation value. The combined deviation value is used to generate an abnormality signal.

STATEMENT OF GOVERNMENT RIGHTS

This invention was made with government support under Contract No.:H98230-04-3-0001 awarded by the Distillery Phase II Program. Thegovernment has certain rights in this invention.

FIELD OF THE INVENTION

The present invention is related to techniques for monitoringabnormalities in a data stream and, more particularly, for detectingrare abnormalities in a data stream having similar but spurious patternsof abnormality.

BACKGROUND OF THE INVENTION

In recent years, advances in hardware technology have made it possibleto collect large amounts of data in many applications. Typically, adatabase processing this data is affected by continuous activity overlong periods of time, thereby allowing such a database to grow withoutlimit. Examples of such data include supermarket data, multimedia dataand telecommunication applications. The volume of data may easily reachmillions on a daily basis, and it is often not possible to store it sothat standard algorithmic techniques may be applied. Therefore,algorithms designed for such data must take into account the fact thatit is not possible to revisit any part of the voluminous data, and thatonly a single scan of the data is allowed during processing. Data ofthis type is commonly referred to as a data stream.

Unlike a traditional data source, a stream is a continuous process whichrequires simultaneous model construction and abnormality reporting.Therefore, it is necessary for a supervision process to work withwhatever information is currently available, and to continually updatean abnormality detection model as new abnormalities occur.

Considerable research has been conducted in the field of data streams inrecent years, see, for example, J. Feigenbaum et al., “Testing andSpot-Checking of Data Streams,” ACM SODA Conference, 2000; J. Fong etal., “An Approximate L^(P)-Difference Algorithm for Massive DataStreams,” Annual Symposium on Theoretical Aspects in Computer Science,2000; C. Cortes et al., “Hancock: A Language for Extracting Signaturesfrom Data Streams,” ACM SIGKDD Conference, 2000; S. Guha et al.,“Clustering Data Streams,” IEEE FOCS Conference, 2000; and B-K. Yi etal., “Online Data Mining for Co-Evolving Time Sequences,” ICDEConference, 2000.

Many traditional data mining problems, such as clustering andclassification, have recently been re-examined in the context of thedata stream environment, see, for example, C. C. Aggarwal et al., “AFramework for Clustering Evolving Data Streams,” VLDB Conference, 2003;P. Domingos et al., “Mining High-Speed Data Streams,” ACM SIGKDDConference, 2000; and S. Guha et al., “Clustering Data Streams,” IEEEFOCS Conference, 2000.

Abnormality detection is an important problem in the data miningcommunity, see, for example, H. Branding et al., “Rules in an OpenSystem: The Reach Rule System,” First Workshop of Rules in DatabaseSystems, 1993; M. Berndtsson et al., “Issues in Active Real-TimeDatabases,” Active and Real-Time Databases, pp. 142-157, 1995; T. Laneet al., “An Application of Machine Learning to Anomaly Detection,”Proceedings of the 20th National Information Systems SecurityConference, pp. 366-380, 1997; and W. Lee et al., “Learning Patternsfrom Unix Process Execution Traces for Intrusion Detection,” AAAIWorkshop: Al Approaches to Fraud Detection and Risk Management, pp.50-56, July 1997. However, these models do not address the prediction ofrare abnormalities in the presence of many spurious, but similar,patterns.

For example, in stock market monitoring applications, it may bedesirable to find patterns in trading activity which are indicative of apossible stock market crash. The stream of data available may correspondto the real time data available on the exchange. While a stock sell-offmay be a relatively frequent occurrence, which has similar effects onthe data stream, one may wish to have the ability to quickly distinguishthe rare crash from a simple sell-off. It may also be desirable todetect particular patterns of trading activity which result in thesell-off of a particular stock, or a particular sector of stocks. Aquick detection of such abnormalities is of great value to financialinstitutions.

In business activity monitoring applications, it may be desirable tofind particular aspects of the stream indicative of significantabnormalities in business activity. For example, certain sets of actionsof competitor companies may point to the probable occurrence ofsignificant abnormalities in the business. When such abnormalities dooccur, it is important to be able to detect them very quickly, so thatappropriate measures may be taken.

In medical applications, continuous streams of data from hospitals orpharmacies can be used to detect any abnormal disease outbreaks orbiological attacks. Certain diseases caused by biological attacks areoften difficult to distinguish from other background diseases. However,it is essential to be able to make such distinguishing judgments in realtime in order to create a credible abnormality detection system.

Abnormalities such as disease outbreaks or stock market crashes mayoccur rarely over long periods of time. However, the value ofabnormality detection is highly dependent on the latency of thedetection. Most abnormality detection systems are usually coupled withtime-critical response mechanisms. Furthermore, because of efficiencyconsiderations, it is possible to examine a data point only oncethroughout the entire computation. This creates an additional constrainton how abnormality detection algorithms may be designed.

In most situations, data streams may show abnormal behavior for a widevariety of reasons. It is important for an abnormality detection modelto be specific in its ability to focus and learn a rare abnormality of aparticular type. Furthermore, spurious abnormalities may besignificantly more frequent than the rare abnormalities of interest.Such a situation makes the abnormality detection problem even moredifficult, since it increases the probability of a false detection.

In many cases, even though multiple kinds of anomalous abnormalities mayhave similar effects on the individual dimensions, the relevantabnormality of interest may only be distinguished by its relative effecton these dimensions. Therefore, an abnormality detection model needs tobe able to quantify such differences.

Since a data stream is likely to change over time, not all featuresremain equally important for the abnormality detection process. Whilesome features may be more valuable to the detection of an abnormality ina given time period, this characteristic may vary with time. It isimportant to be able to modify the abnormality detection modelappropriately with the evolution of the data stream.

SUMMARY OF THE INVENTION

The present invention provides techniques for monitoring abnormalitiesin a data stream and, more particularly, for detecting rareabnormalities in a data stream through an algorithm which can handle theaforementioned complexities. Furthermore, the methodologies of thepresent invention do not require any re-scanning of the data, and aretherefore useful for a very fast data stream.

For example, in one aspect of the present invention, a technique formonitoring a primary data stream, comprising one or more secondary datastreams, for abnormalities is provided. A deviation value is determinedfor each of the one or more secondary data streams. The determineddeviation values of the one or more secondary data streams are combinedto form a combined deviation value. The combined deviation value is usedto generate an abnormality signal.

A supervised approach is utilized in which the abnormality detectionprocess learns from the data stream. A considerable level of specificitymay be achieved by using the behavior of the combination of multiplesecondary data streams which are able to distinguish between differentkinds of seemingly similar abnormalities.

A supervised abnormality detection problem is very different from a datastream classification problem in which each record has a label attachedto it. In a supervised abnormality detection problem, individual recordsare unlabelled, and the abnormalities of importance are attached only toparticular moments in time. Since individual records do not have classlabels, the training of the abnormality detection process is moreconstrained by the limited available information. Furthermore, therarity of the abnormality adds an additional level of complexity to thedetection process.

The methodologies of the present invention provide an effective methodfor detecting rare abnormalities in a fast data stream. Since a datastream may contain many different kinds of abnormalities, it isnecessary to be able to distinguish their characteristic behavior.Therefore, the methodology is able to distinguish particular kinds ofabnormalities by learning subtle differences in how different streamsare affected by different abnormalities. The methodology performsstatistical analysis on multiple dimensions of the data stream in orderto perform the detection. Since the technique is tailored for fastresponses to continuous monitoring of processes, the entire framework ofthe methodology is constructed to facilitate online abnormalitymonitoring of data streams. Therefore, the process can detect theabnormalities with any amount of historical data, but the accuracy islikely to improve with progression of the data stream, or as more datais received.

These and other objects, features, and advantages of the presentinvention will become apparent from the following detailed descriptionof illustrative embodiments thereof, which is to be read in connectionwith the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a hardware implementationsuitable for employing methodologies, according to an embodiment of thepresent invention;

FIG. 2 is a flow diagram illustrating an abnormality signal generationmethodology, according to an embodiment of the present invention;

FIG. 3 is a flow diagram illustrating a statistical deviationcomputation methodology, according to an embodiment of the presentinvention; and

FIG. 4 is a flow diagram illustrating a deviation level combinationmethodology, according to an embodiment of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The following description will illustrate the invention using anexemplary data processing system architecture. It should be understood,however, that the invention is not limited to use with any particularsystem architecture. The invention is instead more generally applicableto any data processing system in which it is desirable to performefficient and effective data stream abnormality detection.

As will be illustrated in detail below, the present invention introducestechniques for monitoring abnormalities in a data stream and, moreparticularly, for detecting rare abnormalities in a data stream havingsimilar but spurious patterns of abnormality.

Referring initially to FIG. 1, a block diagram illustrates a hardwareimplementation suitable for employing methodologies, according to anembodiment of the present invention. As illustrated, an exemplary systemcomprises multiple client devices 10 coupled via a large network 20 to aserver 30. Server 30 may comprise a central processing unit (CPU) 40coupled to a main memory 50 and a disk 60. Multiple clients 10 caninteract with server 30 over large network 20. It is to be appreciatedthat network 20 may be a public information network such as, forexample, the Internet or World Wide Web, however, clients 10 and server20 may alternatively be connected via a private network, a local areanetwork, or some other suitable network.

Records from a data stream are received at server 30 from an individualclient 10 and stored on disk 60. All computations on the data stream areperformed by CPU 40. The corresponding statistical data are stored ondisk 60, and are utilized for detection purposes.

A primary data stream may be realized from one or more channels.Essentially, data from different channels are derived from differentsources, which contain different kinds of information relevant to theabnormality. Each channel may consist of one or more secondarymultidimensional data streams. The secondary data streams are fusedtogether to form the complete primary data stream. For example, for anabnormality detection algorithm in a biological environment, thesecondary data streams may consist of a stream for hospital admissionswith demographic information, a stream for pharmacy sales, and a steamfor attendance records.

Points in time at which the behavior of the primary data stream ismonitored are referred to as ticks. The time stamps associated with theticks are denoted by t(1), t(2), . . . t(k). The ticks and time stampsare distinguished because the behavior of the data stream may notnecessarily be monitored at equal intervals of time. It is assumed thatthe data points arrive only at one of these ticks or time stamps.

The total number of secondary data streams is N, and the set of datapoints associated with the ith stream at tick k is denoted by Y_(i(k)).The data points in the stream Y_(i(k)) are denoted by y_(i(1)),y_(i(2)), . . . y_(i(k)). It is assumed that for each stream i, the datapoint y_(i(j)) arrives at the time stamp t(j). The entire feed of Nstreams at tick k is therefore denoted by Y(k)={Y_(1(k)) . . .Y_(N(k))}.

It is assumed that the time stamps at which the rare abnormalities occurin the primary data stream are denoted by T(1) . . . T(r). Theseabnormalities may either be the primary abnormalities that are desiredto be detected, or they may be spurious, or secondary, abnormalities inthe data stream. For each abnormality k at time T(k), a flag(k) ismaintained. When this abnormality is a primary abnormality, flag(k)is 1. In addition, Q(k) is also maintained, which is the time stamp ofthe last reported occurrence of any abnormality.

The value of Q(k) is typically larger than the true time of abnormalityoccurrence T(k), since the value of Q(k) refers to the abnormalityreport time, whereas the value of T(k) refers to the occurrence time.The last report time is typically larger than the time of the actualabnormality itself, since the external sources reporting the abnormalitywould need a lag to verify it. These external sources may use a varietyof domain specific methods or simply personal observation to decide onthe final verification of abnormality occurrence.

It is assumed that the report of an abnormality is an external input tothe algorithm, and is available only after a reasonable lag after theactual occurrence of the abnormality. Clearly, a detection algorithm isuseful only if it can report abnormalities before they are independentlyreported and verified by external sources. Assuming that k(r)abnormalities have occurred till tick r, the sequence {(Q(1), T(1),flag(1)) . . . (Q(k(r)), T(k(r)), flag(k(r)))} until tick r is denotedby the abnormality vector E(r). The length of this sequence depends uponthe number of abnormalities which have transpired till tick r.

The abnormality detection algorithm outputs a set of time stamps T*⁽¹⁾ .. . T*^((n)) at which it has predicted the detection. A particulardetection T*^((i)) is referred to as a true detection, when for some lagthreshold maxlag, some time stamp T(j) exists, such that0≦T*^((i))−T(j)≦maxlag. Otherwise, the detection is referred to as afalse positive. There is a tradeoff between being able to make a largernumber of true detections and the number of false alarms. If thealgorithm outputs a larger number of detection time stamps in order toreduce the latency, it is likely to report a greater number of falsepositives and vice-versa.

The supervised abnormality detection algorithm continuously detectsabnormalities utilizing the data from the history of previousabnormality occurrences. In addition, a learning phase is triggered onceafter every reported occurrence of a primary or secondary abnormality inorder to update the model. The reporting of an abnormality is anindependent external process and is not dependent upon the actualdetection of an abnormality by the algorithm. In most practicalapplications, abnormalities are eventually detected and reported becauseof a variety of factors such as the actual practical consequences of theabnormality. These report times are often too late to be of practicaluse for abnormality responses. However, they can always be used toimprove the accuracy of the abnormality detection model when required.

Referring now to FIG. 2, a flow diagram illustrates an abnormalitysignal generationa methodology, according to an embodiment of thepresent invention. The methodology begins at block 200 where the modelis initialized at the beginning of the detection process. Theabnormality detection phase is performed at each tick.

It is assumed that at the beginning of the stream monitoring process,some amount of historical data is available in order to construct aninitial model of abnormality behavior. This historical data consists ofthe stream and the abnormalities in the past time window at thebeginning of the abnormality monitoring process. The initial stream isdenoted by Y_(h), and the initial set of abnormalities is denoted byE_(h). Once the abnormality detector has been initialized, themethodology continues in an iterative phase of continuous onlinemonitoring together with occasional offline updating.

In block 210, abnormal statistical deviations for secondary data streamsare computed at a given instant. These statistical deviations are fromexpected values based on historical trends, which is described in moredetail in FIG. 3. In block 220, the deviations of the secondary datastreams are combined in accordance with statistical weights of thechannels, which is described in more detail in FIG. 4. In block 230, anabnormality signal for the data stream is output. The methodologyterminates at block 240.

Referring now to FIG. 3, a flow diagram illustrates a deviationcomputation methodology, according to an embodiment of the presentinvention. FIG. 3 may be considered a detailed description of block 210in FIG. 2. The computation of the level of statistical deviations at agiven instant is necessary for the abnormality signal determinationphase as well as a learning model. The methodology begins at block 300.In block 310, a polynomial approximation is computed. A polynomialregression technique may be used which can compute the statisticallyexpected value of the secondary data stream at a given moment in time.The polynomial regression function may be computed using a least squareserror criterion. Thus, in block 320, a predicted value is computed fromthe polynomial approximation.

Consider the tick r at which the points y_(i(1)) . . . y_(i(r)) havealready arrived for stream i. For each 1 in {1, . . . r}, the techniqueapproximates the data point y_(i(1)) by a polynomial in t(1) or order k.In other words, the data point y_(i(1)) is approximated by thepolynomial function f_(i(k,1)), where:f _(i(k,1))=Σ^(k) _({j=0}) a ¹ _({ij}) .T(1)^(j)

Here, the coefficients of the polynomial function are a¹ _({i1}). . . a¹_({ik}). The values of a¹ _({ij}) need to be computed using the actualdata points in order to find the best fit. Specifically, the data pointswithin a maximum window history of h₁ are used in order to compute thecoefficients of this polynomial function. While these coefficients canbe estimated quite simply by using a polynomial fitting technique, notall points are equally important in characterizing this function. Thisfunction is used to compute the predicted value.

The predicted value is then used in order to compute the statisticaldeviation between the actual and predicted value, which is achieved instep 330 of FIG. 3. Data from previous abnormality occurrences are usedin order to create a distinguishing model for the particular kind ofabnormality which is being tracked. This model for distinguishingdifferent kinds of abnormalities needs modeling which is done offline.The methodology terminates at block 340.

Referring now to FIG. 4, a flow diagram illustrates a deviation levelcombination methodology for secondary data streams, according to anembodiment of the present invention. FIG. 4 may be considered a detaileddescription of block 220 in FIG. 2. In many cases, even when secondarydata streams are similarly affected by different kinds of abnormalities,the relative magnitudes of the streams could vary considerably. It isdesirable to create a function of the z-numbers of the different streamswhich is a “signature” of that particular kind of abnormality. In orderto achieve this goal, a new signal is created at each tick, which is alinear combination of the signals from the different secondary datastreams. Let α₁ . . . α_(N) be N real coefficients. The following newsignal Z(r) is defined in terms of the original signal and the alphavalues:Z(r)=Σ^(N) _({i=1})α_(i) . . . z _(i(r))

The methodology begins at block 400. Many of the secondary data streamsand their corresponding channels may be noisy and will not have anycorrelation with the primary abnormality. Such streams and theircorresponding channels need to be discarded from the abnormalitydistinguishing process. In other words, the corresponding values ofα_(i) need to be set to zero. The first step is to identify suchchannels in block 410. In block 420, the irrelevant data stream channelsare removed. For each of the time stamps T(j) in {T(1) . . . T(s)} atwhich an abnormality of interest has occurred, the largest valuemax_({ij}) of z_(i(r)) is found for each r such thatT(j)≦t(r)≦T(j)+maxlag. A stream i is said to be interesting to theabnormality detector, when for eachj in {1 . . . s} the value ofmax_({ij}) is larger than a predefined thresholdz_({min}. This subset of streams {i) ₁ . . . i_(w)} in {1 . . . N} isdenoted by S.

In block 430 statistical weights are computed for each relevant channel.The statistical weights may be computed through a learning algorithm ora least squares error optimization. In block 440, the deviation levelsof the secondary data streams from the relevant channels are combined inaccordance with the corresponding weighted sums of the relevantchannels. Once a small number of streams are selected, which aremeaningful for the abnormality detection process, the value of thediscrimination vector alpha is found which distinguishes the primaryabnormalities from other similar abnormalities. The main idea is tochoose alpha in such a way so that the value of Z(r) peaks just afterthe occurrence of each primary abnormality to a much greater extent thatany other abnormality. It is assumed that the time stamps at which allsecondary abnormalities which have happened within the previous historyof h₁, are given by t(i₁) . . . t(i₁), whereas the time stamps of theprimary abnormality are given by {T(1) . . . T(s)}={t(j₁) . . .t(j_(s))}. For each secondary abnormality i_(k) and each stream j, themaximum value of z_(j(r)) is computed for each value of r, such thatt(r) in (t(i_(k)), t(i_(k))+maxlag). Let the corresponding time stamp begiven by ts*^((k)) _(j) for each k in {1 . . . 1}. This time stamp isthen averaged over all streams which lie in S. Therefore, for eachsecondary abnormality k, Ts*^((k))=Σ_({i in s})ts*^((k)) _(j)/|S| iscomputed. Similarly, for each occurrence of the primary abnormality, theaverage time stamp Tp*^((k)) is computed for each k in {1 . . . s}. Inorder for the discrimination between primary and secondary abnormalitiesto be as high as possible, the difference in the average value of thecomposite signal at the time stamps of the true and spuriousabnormalities must be maximized. The values of alpha are chosen in sucha way that the ratio of the signal at the times of the primaryabnormalities to the value of the signal at the times of the secondaryabnormalities is maximized.ZP(r+1)=Σ_({t in S})α_(i) .z _(i)(r+1)

This value is the signal which is specific to the primary abnormality.The greater this value, the higher the likelihood that a primaryabnormality has indeed occurred in the stream. The methodologyterminates in block 450.

A primary abnormality is predicted by using a minimum threshold on thevalue of ZP(r+1). Whenever the value of ZP(r+1) exceeds this threshold,a discrete signal is output which indicates that the abnormality hasindeed occurred as shown in block 230 of FIG. 2. The use of higherthresholds on the abnormality detection signature results in lowernumber of false positives, but lower detection rates as well as higherlags.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may bemade by one skilled in the art without departing from the scope orspirit of the invention.

1. A method for monitoring a primary data stream, comprising two or moresecondary data streams, for abnormalities, comprising the steps of:determining a deviation value for each of the two or more secondary datastreams; and combining the two or more deviation values of the two ormore secondary data streams to form a combined deviation value, whereinthe combined deviation value is used to generate an abnormality signal.2. The method of claim 1, wherein the step of determining a deviationvalue further comprises the steps of: computing a polynomialapproximation; computing a predicted value from the polynomialapproximation; and computing a deviation value using an actual value andthe predicted value.
 3. The method of claim 2, wherein, in the step ofcomputing a polynomial approximation, the polynomial approximationcomprises a polynomial regression function.
 4. The method of claim 3,wherein, in the step of computing a polynomial approximation, thepolynomial regression function is computed using a least squares errorcriterion.
 5. The method of claim 1, wherein, in the step of combiningthe two or more deviation values, the two or more secondary data streamsare associated with one or more channels of the primary data stream. 6.The method of claim 5, wherein the step of combining the two or moredeviation values further comprises the steps of: determining one or morerelevant channels from the one or more channels of the primary datastream; computing a statistical weight for each of the one or morerelevant channels; and combining two or more deviation values of the twoor more secondary data streams associated with the one or more relevantchannels in accordance with one or more corresponding statisticalweights.
 7. The method of claim 6, wherein, in the step of computing astatistical weight, a learning algorithm is used.
 8. The method of claim6, wherein, in the step of computing a statistical weight, a leastsquares error optimization is used.
 9. The method of claim 1, wherein,in the step of combining the two or more deviation values, theabnormality signal comprises a value relating to the likelihood of anabnormality in the primary data stream.
 10. The method of claim 1,wherein the steps of determining a deviation value and combining the twoor more deviation values are repeated for each point in time at whichthe primary data stream is monitored.
 11. Apparatus for monitoring aprimary data stream, comprising two or more secondary data streams, forabnormalities, comprising: a memory; and at least one processor coupledto the memory and operative to: determine a deviation value for each ofthe two or more secondary data streams; and combine the two or moredeviation values of the two or more secondary data streams to form acombined deviation value, wherein the combined deviation value is usedto generate an abnormality signal.
 12. The apparatus of claim 11,wherein the operation of determining a deviation value further comprisesthe operations of: computing a polynomial approximation; computing apredicted value from the polynomial approximation; and computing adeviation value using an actual value and the predicted value.
 13. Theapparatus of claim 12, wherein, in the operation of computing apolynomial approximation, the polynomial approximation comprises apolynomial regression function.
 14. The apparatus of claim 13, wherein,in the operation of computing a polynomial approximation, the polynomialregression function is computed using a least squares error criterion.15. The apparatus of claim 11, wherein, in the operation of combiningthe two or more deviation values, the two or more secondary data streamsare associated with one or more channels of the primary data stream. 16.The apparatus of claim 15, wherein the operation of combining the two ormore deviation values further comprises the operations of: determiningone or more relevant channels from the one or more channels of theprimary data stream; computing a statistical weight for each of the oneor more relevant channels; and combining two or more deviation values ofthe two or more secondary data streams associated with the one or morerelevant channels in accordance with one or more correspondingstatistical weights.
 17. The apparatus of claim 16, wherein, in theoperation of computing a statistical weight, a learning algorithm isused.
 18. The apparatus of claim 16, wherein, in the operation ofcomputing a statistical weight, a least squares error optimization isused.
 19. The apparatus of claim 11, wherein, in the operation ofcombining the two or more deviation values, the abnormality signalcomprises a value relating to the likelihood of an abnormality in theprimary data stream.
 20. The apparatus of claim 11, wherein theoperations of determining a deviation value and combining the two ormore deviation values are repeated for each point in time at which theprimary data stream is monitored.
 21. An article of manufacture formonitoring a primary data stream, comprising two or more secondary datastreams, for abnormalities, comprising a machine readable mediumcontaining one or more programs which when executed implement the stepsof: determining a deviation value for each of the two or more secondarydata streams; combining the two or more deviation values of the two ormore secondary data streams to form a combined deviation value, whereinthe combined deviation value is used to generate an abnormality signal.